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Abstract 

Observational time series data often exhibit both cyclic temporal trends and autocorrelation and may also 
depend on covariates. As such, there is a need for flexible regression models that are able to capture these 
trends and model any residual autocorrelation simultaneously. Modelling the autocorrelation in the residuals 
leads to more realistic forecasts than an assumption of independence. In this paper we propose a method 
which combines spline-based semi-parametric regression modelling with the modelling of auto-regressive 
errors. 

The method is applied to a simulated data set in order to show its efficacy and to ultrafine particle number 
concentration in Helsinki, Finland, to show its use in real world problems. 

Keywords: autoregressive errors, semi-parametric regression, Bayesian inference, generalised additive 
model, ultrafine particles, aerosols 



1. Introduction 

Continuously measured time series may exhibit 
regular temporal trends, non-linear dependence on 
covariates (and interactions thereof) and autocorre- 
lation. This motivates the desire for flexible regres- 
sion models which are able to take these features 
into account without specifying the functional form 
of the relationship a priori. 

The use of splines for non- and semi-parametric 
modelling smooth curves and surfaces (Silverman, 
1985), time series (Waliba, 1990) and non- linear co- 
variate effects (Lin and Zhang, 1999) is well estab- 
lished (Ruppert, Wand and Carroll, 2009). Splines 
have simple to construct bases (dc Boor, 1978) and 
can be extended to include smoothness penalties, 
periodic bases, and interactions (Eilcrs and Marx, 
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2010). The flexibility of splines has led to their 
adoption as bases for Generalised Additive Mod- 
els (Lin and Zhang, 1999; Lang and Brezger, 2004; 
Wood, 2006). 

Harvey and Koopman (1993) describe a spline 
based model which flexibly models periodic trends 
by allowing the spline coefficients to evolve accord- 
ing to a random walk. This model admits the use 
of covariates but the error terms are assumed inde- 
pendent and identically distributed. 

An important consideration when seeking to fit 
a model for forecasting time series data is that 
the residuals may display some degree of autocor- 
relation. Rather than fitting the mean predictor 
with an assumption of independent and identically 
distributed residuals and then performing post hoc 
analysis of the residual autocorrelation, including 
the autocorrelation in the modelling will ensure 
that the estimates of the model parameters have 
taken the autocorrelation into account, leading to 
more realistic forecasting (Club, 1993; M0lgaard, 



Hussein, Corandcr and Hamcri, 2012). 

Traditional time series approaches (e.g. Box and 
Jenkins, 1994) are able to account for error struc- 
tures other than independent and identically dis- 
tributed errors through the specification of the 
Moving Average component of an ARIMA model 
(Venables and Ripley, 2002). Seasonality can be 
modelled in the ARIMA framework by including a 
term of the form (1 — L*), where L is the lag opera- 
tor and s is the length of the period in terms of the 
frequency of the time series. The inclusion of this 
term accounts for seasonality by effectively remov- 
ing the seasonal trend at lag s. Seasonal decom- 
position with loess (Cleveland, Cleveland, McRac 
and Tcrpcnning, 1990; Cleveland, Grosse and Shyu, 
1993) can perform interpolation and smooth esti- 
mation of seasonal trends with local polynomial re- 
gression. A common use of these seasonal decom- 
position models is to seasonally adjust a time series 
in order to examine the residual trends (Findlcy, 
MonseU, Bell, Otto and Chen, 1998), rather than 
examining the temporal trends themselves. Fore- 
casting from these seasonally decomposed local re- 
gression models is possible, with ARIMA modelling 
providing the forecasts of the departures from the 
seasonal patterns. 

One of the most flexible and easily implemented 
bases for the Generalised Additive Model is the 
penalised B-spline (Eilers and Marx, 1996). De- 
spite the development of the GLM with autocor- 
related errors (Chib, 1993) and the simplicity of 
the B-spline there does not appear to be an at- 
tempt made to incorporate both of these modelling 
approaches. A Bayesian nonparametric regression 
model with autocorrelated errors was proposed and 
implemented by Smith, Wong and Kohn (1998) 
which uses cubic regression splines rather than pe- 
nalised splines. Penalised B-splines are attractive 
because when a large number of spline basis func- 
tions are used, any excessive wiggliness is penalised 
and a smoother fit obtained. 

In this paper a method which combines spline- 
based semi-parametric regression modelling with 
the modelling of auto-regressive errors is proposed. 
It is possible to define a model for autoregressive 
(AR) errors in WinBUGS (Lunn, Thomas, Best 
and Spicgelhaher, 2000; Hay and Pcttitt, 2001) but 
incorporating them with complex spline-based re- 
gression fixed effects terms is cumbersome in this 
modelling environment (Crainiccanu, Ruppert and 
Wand, 2005). The approach outlined below is a 
combination of the work of M0lgaard et al. (2012) 



and Clifford, Low Choy, Hussein, Mengersen and 
Morawska (2011). 

Section 2 features a review of the Generalised 
Additive Model (GAM) and autocorrelated errors. 
The use of penalised splines is outlined, including 
how to construct penalties for multivariate bases 
and how to ensure the identifiability of the model 
(as the spline basis has full rank). Gibbs sampling 
for the spline coefficients and parameters for the au- 
toregressive errors are discussed, as is the Metropo- 
lis sampler for the hyperprior which penalises the 
wiggliness of the splines. Forecasting with AR er- 
rors is also discussed, including posterior checks for 
predictive performance. 

In Sections 3 and 4 the model is applied to two 
data sets. The first is a set of simulated time se- 
ries data with a two dimensional (2D) covariate and 
autocorrelated errors added. This shows that the 
model recovers the temporal trend, autocorrelation 
of the errors and the 2D covariate effect. The pos- 
terior covariance of the sampled residuals are anal- 
ysed to show the reduction in autocorrelation. The 
full regression and forecasting model is applied to 
observations of PNC and meteorology recorded in 
Helsinki. 

In Section 5 the features of the proposed model 
are reviewed in light of the case studies and sugges- 
tions made for extensions to the modelling approach 
presented. 

2. Methodology 

Bayesian regression modelling can be conceptu- 
alised as starting with a set of assumptions about 
model parameters (prior belief) and using collected 
data to update those assumptions (to obtain the 
posterior belief). Mathematically, the prior beliefs 
about the parameters, 6, such as their mean and 
variance, are represented as a distribution, p{0). 
The regression model and data are represented by 
a distribution where the likelihood of the data (X) 
is conditioned on the parameters, p(X | 0). The 
posterior is obtained, through Bayes' rule by mul- 
tiplying the prior and the likelihood, p (0 | X), and 
represents a probability distribution for the param- 
eters, conditioned on the observed data ^ . 

Chib (1993) specifies a Bayesian linear model in 
which the responses have either a Gaussian or t 



^¥or a more thorough review of Bayesian statistics, see 
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likelihood and the residuals are assumed to be auto- 
correlated rather than independent and identically 
distributed. In this section, an additive model with 
autocorrelated errors is developed by using splines 
to estimate non-linear covariate effects (Hastie and 
Tibshirani, 1990). The benefits of spline regression 
over polynomial regression have been discussed by 
Ruppert et al. (2003) and Crainiceanu et al. (2005). 

Current software packages for Bayesian semi- 
and non-parametric Bayesian regression, such as 
R-INLA (Rue and Martino, 2010), mgcv^ (Wood, 
2011) and BayesX (Belitz ct al., 2009), do not deal 
with autocorrelated residuals. 

The methodology is implemented in MATLAB. 
The regression model is fit to the data which is 
observed and whose lagged data has also been ob- 
served. That is, we omit any data where any of 
yi or its lagged values are missing (represented in 
MATLAB as NaN, "not a number"). 

2.1. Generalised Additive Models 

Generalised Additive Models (GAMs) can be 
treated as equivalent to semiparametric regres- 
sion with the Generalised Linear Mixed Model 
(GLMM) where the nonparametric smoothers, such 
as splines, are the random effects (Ruppert, Wand 
and Carroll, 2003; Fong, Rue and Wakefield, 2007). 
Alternatively, the basis matrix for the parametric 
part of the model can be augmented with the basis 
matrices for said smoothers, providing for the con- 
ception of the GAM as replacing the linear terms 
in the GLM with additive smoothers (Hastie and 
Tibshirani, 1990). 

A GAM can therefore be expressed as 

M» = /3o + /l {Xil) + /2 {X2i) + ■■■ + Im [xMi) 

for M covariates, Xm* , each with their own univari- 
ate spline, fm(xm*), and a link function, g{-), as 
in the GLM. The design matrix of the GLM, X, 
now represents the spline design matrix. It has as 
many rows as there are observations and X]m=i '^m 
columns, where dm is the number of basis splines 
used in the construction of fm{-). 

The GAM can thus be written in a form which is 
equivalent to the formulation of a GLM, but with 
splines replacing the linear terms in the predictor, 

g iy) = X/3 + e. (1) 



^while not strictly Bayesian, mgcv gives Bayesian sum- 
maries 



2.2. Penalised splines 
2.2.1. B-splines 

The B-spline basis is built according to a recur- 
sive algorithm (dc Boor, 1977; Eilers and Marx, 
1996). A group of step functions, each of which 
is non-zero between two adjacent knots, tj-i and 
<j, are operated on to build a group of continuous 
linear functions which are non-zero between three 
adjacent knots such that the basis functions over- 
lap. This process is repeated until the desired basis 
order is obtained. 

The order k B-spline for covariate x 

Bj,k{x) _ x-tj Bj^k-i{x) 

^j-\-k ~ tj-\-k—l tj-\-k—l 

tj + k — tj tj+k — + 1 

on a grid of knots t with Bjfl{x) = 1 between knots 
tj-i and tj. That is, the zero-order B-spline basis 
is constant between successive knots and higher or- 
der bases are created according to the recurrence 
relation. 

A cubic B-spline basis function consists of three 
quadratic polynomials which are piecewise continu- 
ous and smooth in the intervals between four neigh- 
bouring knots and the basis outside these knots is 
identically zero. Each B-spline basis function has 
compact support (i.e. are non-zero only on an open 
interval) and so allow the flexible modelling of non- 
linear effects by responding to local changes in the 
relationship between the corresponding covariate 
and the response. 

The basis splines are calculated once for each 
non-linear function to be approximated and so the 
choice of the order of spline doesn't affect the com- 
plexity or speed of the calculations in the MCMC 
simulation, although the number of knots chosen 
will do so. The choice of a particular order of spline 
basis can therefore be made according to the desired 
continuity, smoothness and differentiability proper- 
ties of the fitted smooth. A B-spline basis function 
will have non-zero derivatives up to and including 
the order of the basis spline, e.g. a first order spline 
will be once differentiable. 

Figure 1 shows ten first order B-splines, ten sec- 
ond order B-splines and a linear combination of the 
same to approximate the function y — sin (2ttx'^^ . 
The first order splines are piecewise continuous and 
that the second order splines are piecewise smooth. 
Each individual basis function is coloured a differ- 
ent shade of grey; the colouring scheme is consistent 
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across each of the four figures. The approximation 
is shown as a thick grey line and the true function 
is shown as a thick, black, dashed line. 





Figure 1: Example B-splines. Top row: ten first 
order B-splines, ten second order B-splines. Bot- 
tom row: Use of B-splines to approximate y = 
sin (27ra;^) where the true function is a dashed black 
line and the approximation is a solid grey line. 

Figure 2 shows a 2D tensor basis of B-splines. 

A cyclic B-spline basis can be defined by adding 
the basis splines with non-zero value at the edge of 
the univariate covariate space such that the basis 
is piecewise continuous at the boundary and any 
order preserving permutation of the knots on the 
periodic covariate space results in a valid B-spline 
basis. 

2.2.2. Penalisation 

A smoothing penalty can be imposed to ensure 
that the fitted B-splines are not too "wiggly" . Fil- 
ers and Marx (f996) introduce the penalised B- 
spline, or "P-spline", in a frequentist context, mod- 
ifying the normal equations to include a penalty 
matrix based on discrete differences between the 
spline coefficients. 

Let D = D(fc) be a discretised operator equiv- 
alent to the fcth derivative of the identity matrix 
with dimension equal to the number of coefficients 
in the B-spline basis of interest. This matrix will 
be rank deficient, with the deficiency in rank being 
equal to the order of the derivative. Applying the 
differential operator to a vector /3 e M'*, where d is 
the number of basis elements in a univariate spline 
whose coefficients are /3, yields a vector with a value 
of the discretised value of that differential operator 






Figure 2: Tensor product of two B-splines. xi is 
modelled with four second order B-splines, X2 with 
five second order B-splines; the tensor product thus 
has 20 basis elements. Shading is according to the 
value of the joint basis. Each image represents a 2D 
basis spline in the joint basis as a matrix, the spline 
basis matrix will consist of these matrices reshaped 
as column vectors. 



evaluated at (3, and has a dimension equal to the 
number of rows of the differential operator, i.e. for 

As D/3 is the application of the discretised differ- 
ential operator, l"^D/3 = Y^'jZi ^i-*f^^ is the sum of 
the discretised derivatives. The continuous analogy 
is / /W(a;)dx. Then 



/3^D^) (D/3) = / f/C^)] Ax 



and is the discretised sum of the squares of the 
derivatives (Wood, 2003). The quantity to be min- 
imised in the penalised B-spline, then, is the sum 
of the square of the derivative (Filers and Marx, 
1996). 

Lang and Brczgcr (2004) reformulate this penalty 
matrix as a zero mean conjugate multivariate nor- 
mal prior on the B-spline coefficients, /3 e M*^^^, 

(/3| A)(xA('^-'=)/2exp('-^/3^K/3') (3) 



A^r(l,6) 



(4) 
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where A is the smoothing parameter which pe- 
nahses the "wiggUness" of the resuhing sphne and 
K € K'^^'' is the matrix form of the discretised wig- 
ghness penahy as in Eilers and Marx (1996). The 
parameter for the scale of the prior on A, b, is cho- 
sen such that the prior has a smaU variance and 
is concentrated aromid A = 0. The prior for /3 is 
improper, but a non-diffuse prior on A ensures that 
the posterior for j3 is proper. The rank deficiency 
of the penalty matrix is overcome by adding a small 
amount to the diagonal of K of no more than about 
0.001 (the diagonal values of K are at least 2, the 
first order penalty case). 

There are other ways to penalise the B-spline 
coefficients than the P-spline approach of Eilers 
and Marx (2003). These include the approach of 
O'Sullivan (1988) and its Bayesian analogue, devel- 
oped by Wand and Ornierod (2008). 

2.2.3. Interaction terms 

Harvey and Koopman (1993) describe a spline 
based model which is able to flexibly model peri- 
odic trends by allowing the spline coefficients to 
evolve according to a random walk. This model is 
applied to a joint model of daily and hourly trends 
and treats the coeflicients of the daily trend spline 
as evolving according to a random walk model from 
day to day. While this is an obvious choice for mod- 
elling an interaction between a factor and a con- 
tinuous covariate, it does not make sense for two 
continuous covariates. 

A tensor product of the bases for two univariate 
B-spline matrices can be constructed in order to fit 
an interaction term of those covariates (Eilers and 
Marx, 2003; Marx and Eilers, 2005). Penahsation 
of the coefficients for this tensor basis is achieved by 
forming the Kronecker tensor product, denoted ®, 
of the identity matrix with size equal to the number 
of basis elements and the univariate penalty matrix 
for each univariate spline. This leads to a formu- 
lation of the smoothing penalty prior in terms of 
two penalty matrices, one along each axis of the 
covariate space corresponding to each of the two 
covariates. The prior takes the form 

(/3 I Ai, A2) cx exp (^-^/3^Pi/3 - ^fj^V^fi^ 

Ai,A2 -r(i,6) 
P2 = Id, (E) K2 

(5) 



for di , d2 the number of basis elements for covari- 
ates 1 and 2 respectively. The matrices Pi and P2 
are square matrices of dimension c?i x d2, the first 
of which is block diagonal, with each block com- 
posed of tridiagonal matrices, corresponding to the 
penalties on the first covariate's basis (indexed se- 
quentially). The second, P2, has a banded struc- 
ture, corresponding to the penalties on the second 
covariate's basis (where the indexing is sequential, 
di at a time). An example of a 2D tensor product 
of B-splines is shown in Figure 2 in the appendix. 

The above tensor product formulation with a 
penalty for each univariate basis refiects the pos- 
sibility that a different amount of smoothing may 
apply in each direction, i.e. that smoothing is not 
isotropic. While R-INLA provides a rw2d GMRF 
latent model for two-dimensional smooth terms, the 
single precision parameter of the GMRF (equivalent 
in formulation to a univariate P-spline prior) does 
not reflect that there may be a different amount of 
smoothing required in each direction. This multiple 
penalty is addressed in BayesX, along with an inter- 
action between the smoothing terms (Bclitz et al., 
2009). 

The random walk time evolving spline of Harvey 
and Koopman (1993) can be recovered by using a 
first order random walk P-spline penalty matrix as 
the precision matrix for a factor term. 

2.3. Parametric terms 

In this section, the term "parametric" denotes 
regression basis functions with a parametric form 
specified a priori. These parametric terms, such as 
linear terms, polynomials and sinusoidal functions, 
will be modelled as fixed effects with Gaussian pri- 
ors (multivariate where appropriate), according to 
the method outhned by Chib (1993). 

In section 2.2.1 the periodic B-spline basis was 
briefly discussed. Another choice of basis for 
semi-parametric modelling of periodic functions is 
the Fourier series basis, consisting of sin(^J^), 
cos (^f^), c = 1 . . . C. The full Fourier series ba- 
sis also includes an intercept term which should be 
omitted from the regression model to ensure iden- 
tifiability of the overall mean. In the case of the 
splines section 2.6.1 discusses how identifiability is 
enforced by centring the splines; the sin and cos 
terms are already centred. 

The sinusoidal functions may be treated as any 
other parametric (e.g. linear) term in the regres- 
sion equation. It is appropriate to assume that the 
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Fourier coefficients are uncorrelated since by defini- 
tion the Fourier basis is orthogonal. An appropriate 
prior, therefore, is a weakly informative multivari- 
ate normal with zero mean and a covariance matrix 
whose structure is diagonal. 

Heiler and Feng (2000) provide an example of 
using the Fourier series basis to approximate tem- 
poral trends. As with all variable selection prob- 
lems, the question is how many terms must be in- 
cluded? A difficulty of using Fourier series is that 
using too few terms constrains the "wiggliness" of 
the resulting approximated function whereas using 
terms with too high frequency may lead to spurious 
oscillations unless the prior shrinks the coefficients 
of higher frequency oscillations to zero as in Lenk 
(1999). Instead of this approach, the use of peri- 
odic B-splines with a smoothing penalty is adopted. 
Here it is merely pointed out that the Fourier ba- 
sis may be used, as was done by M0lgaard ct al. 
(2012). 

2.4- Factor terms 

Factor terms may arise as random effects in a 
GAMM, e.g. a site-specific mean in multi-site data, 
and may be treated by coding all levels of the fac- 
tors to an integer, as in R-INLA. For a factor basis 
matrix with sequential integers indexing the fac- 
tors, F G M"^'^, for covariate x g M", the element 
F,j = lifx, =i(jel,2,...,J). 

For the coprecision of the factor term, Qp, it is 
simple to assume that the factors are independent 
and identically distributed, so that Qf = A^I for 
precision parameter Xp. If there is some known 
structure in the factors or if a factor is being used to 
describe a zero order spline (constant between knots 
with knots spaced between factor indices) then a 
smoothing penalty prior may be used, as for a P- 
spline. 

The factor terms are centred around zero with 
the same identifiability constraint as in 2.6.1. 

2.5. Autocorrelated residuals 

Many time series are highly autocorrelated. Any 
temporal variation not explained by a periodic 
spline basis (e.g. hour of day, day of year) in the 
proposed model can treated by modelling the time 
series of errors, £i, with an autoregression model, 



0(L)ei = u^ 

u, - TV (0,0-2) 



(6) 



where L is the lag operator and 0(-/j) is a p-deg: 
polynomial in the lag operator 



;ree 



<j){L)xi = Xi + (f>lXi-i + (t)2Xi-2 + • • • + <PpXi^p 

describing the autocorrelation structure of the 
residuals in 1 (Box and Jenkins, 1994). 

Chib (1993) suggests the use of a multivariate 
normal prior for the coefficients of the above poly- 
nomial. A censoring condition, that the absolute 
sum of the coefficients is strictly less than one, is 
applied in order to ensure that the error process is 
stationary. The prior is 



< 1. 



where the prior mean, ^q, and precision, $0 G 
]^pxp^ are set such that the prior for the autoregres- 
sion parameters is weakly informative. The prior 
for the variance of the autoregressive errors is a con- 
jugate inverse Gamma, 



By using an autoregressive error structure, fore- 
casting from the model will give a more realistic 
representation of the relationship between succes- 
sive observations/forecasts. If the residuals are in- 
dependent, this will manifest as the zero vector be- 
ing contained within the multivariate credible set 
of (i>. 

By sampling the residuals in the MCMC scheme 
and using these for forecasting, a distribution for 
each forecast is readily obtainable. This will be 
especially important when making a forecast which 
is dependent on a previously forecast observation. 
Ignoring the uncertainty in the distribution of the 
forecast value yt+i would produce an estimate of 
yt+2 which assumed that jjt+i was observed rather 
than forecast. 

2. 6. Estimation via MCMC 

The form of the GAM with log link, Gaussian 
likelihood and autocorrelated residuals is 



log Vi ^Hi+Ei 
4){L)£i = Ui 

u, -AA(0,o2) 
where X, ^ is row i of matrix X. 



(7) 
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A model for a Gaussian hierarchical linear model 
with autoregressive errors is given by M0lgaard 
et al. (2012), following the formulation of Chib 
(1993). The priors for cr^, f3 and 0, in this for- 
mulation are 

.'~XS(|,|) (8) 

0cx7WVAA(0o,$o-i) : |l^0| < 1. 

The terms (f>o and /3q are the respective mean val- 
ues for the weakly informative priors for (f> and f3, 
taken to be in each case. The values for vq and Sq 
are taken to be —k and 0, respectively, as in Chib 
(1993). The prior precision matrix for the autore- 
gressive error model, $o, is set as 10~^Ip so that 
the prior is weakly informative, such. The prior 
precision matrix for (3, Aq is discussed below. 

Chib sets the prior precision of /3, Ag, as a scalar 
multiple of an identity matrix such that the prior 
is weakly informative. Here, however, Aq is a block 
diagonal with each block containing a smoothness 
penalty precision matrix, AjKj, for a spline term. 
The A are therefore hyperparameters for the prior 
on (3. Any parametric terms to be included in the 
model can be treated similarly, by including their 
prior precision as a series of individual elements on 
the diagonal (in the case that they are indepen- 
dent), a block consisting of a diagonal matrix (for 
sets of independent, identically distributed param- 
eters) or a block structured as a symmetric positive 
definite matrix (if there is some known correlation 
structure). The prior for (3 is now is very infor- 
mative due to the spline penalties contained within 

Aq. 

The posterior distributions for the linear predic- 
tor terms, /3, autoregressive coefficients, <p, and 
model variance, ct^ are given in Table 1. 

The matrix X*, and the vector y* are formed by 
applying the general lag polynomial 1 — (f){L) to X 
and y respectively, reducing the number of rows by 
p, the number of lags included in the autoregressive 
residual term. That is, 

X = Xi...„_p_* — (j)i^2...n~p+l,* — 

4'2'^3...7i-p+2,* ... — (l>pXp+l...n,* (12) 

and similarly for y* . 

E S K("~p)^p contains, as its columns, the resid- 
uals, e, at each of the p lags, such that the i*^ row 



is [^r^—i, . . . , p], 

E = [e I i£ I L^e I • • • I L^e] . (13) 

In 10, ll'llj represents the Euclidean 2-norm. 

Rather than first fitting the linear predictor with 
the assumption of independent, identically dis- 
tributed errors and then fitting an autoregressive 
model for the residuals so that (p is conditioned on 
j3 but not vice versa, this comprehensive model al- 
lows simultaneous estimates of /3 and <p. 

The directed acyclic graph of the model is given 
in Figure 3. Sampling is performed by successively 
sampling from the marginals of /3^*^ from 9 (then 
correcting /S*-*-* according to 14 and 15), </>^*'' from 

11, fr om 10 and sample A*-*-* from 17 and 18 

with a Metropolis-Hastings step. 

2.6.1. Identifiability of spline coefficients 

A d-dimensional spline basis B will have full rank 
and therefore the matrix [1|B] e M"^ is rank 
deficient. In order to enforce identifiability the B- 
spline parameters are centred at each step of the 
Gibbs sampler such that the sum of the marginal 
effect of that spline is zero (Wood, 2003) and cen- 
tering constant is transferred to the intercept term 
in the model, /Jq- 

At step t in the Gibbs sampler, for each block of 
the covariate matrix X e M^^'' corresponding to a 
B-spline, indexed within X by ind we set 

where the centring constant, is calculated as 
that which satisfies 

N 
1=1 

(5^= ^^*-'"'^^i"d ^ (15) 

2.6.2. Lack of support 

It is possible that B-spline knots will be placed 
in a region of the covariate space where there are 
no observed values, especially when dealing with 
interaction terms. For a basis spline defined in a 
region of no support (i.e. there are no observations 
in the relevant region of the covariate space) the 
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where 



(/3 I y, a\ 0) ^ MVU [hp (Ao/3o + X*^y*) , a^kp) 



-p + i^o + fc <5o + Q/J + l|y* - x*/3|| 



2 ' 2 



(0 I y, /3, (t2) ^ MVN (A^ ($o'/'o + ^r-^E^e) , A 



A„ = Ao+X*'X^ 



= {f3- f3o) Ao (/3 - /3o 



A^= (^o+'T-^E^E 



(9) 

(10) 
(11) 



Table 1: Posterior densities for GAM with AR residuals 



j=l..nsplines 




Figure 3: Directed acyclic graph of the model in 7 with priors 8, 3, 4 and 5. 



posterior of the corresponding coefficient is unde- 
fined. In these cases, a sample will be taken from 
the prior. Due to the informative nature of the 
penalty prior, the posterior variance of this param- 
eter will be much smaller than if the spline was 
unpenalised. The corresponding marginal effect of 
the entire spline will have a wider credible interval 
in the region of this spline's support. 

An alternative is to use a spline basis with global 
support for the covariate responsible for the lack of 
support, such as the semi-parametric low rank thin 
plate spline described by Ruppert et al. (2003) and 
Crainiceanu et al. (2005) as 

J K 
f{Xi) = ^Pjxl + ^lk \X^ - KfeP 
3 = 1 fc=l 

where J < m is the order of the parametric fixed 



effect polynomial, m is the order of the random ef- 
fects polynomial splines and the are knots placed 
at the k/{K + 1) quantiles of x. A suitable prior for 
(/3i, . . . ,/3j,7i, . . .^k) is a weakly informative mul- 
tivariate normal with an identity precision matrix. 

2.6.3. Sampling spline penalties 

In order to sample the P-spline penalty correctly 
we need to extend the Gibbs sampler to also sam- 
ple the A, which are hyperparameters for the prior 
distribution for j3. The penalty matrices K and 
P, suggested by Lang and Brczger (2004), however, 
have a determinant of zero and thus the posterior 
of A is undefined. We avoid this by adding a small 
value to the diagonals of these matrices to ensure 
that they are non-singular and thus invertible, e.g. 
the penalty becomes A (K + lO^^l). 

For the one-dimensional splines it is rather simple 
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to find the conditional distribution of the A hyper- 
parameters when K e R'^^'^ has non-zero determi- 
nant. The posterior is derived as 



p(A|/3)cxp(/3| A)p(A) 



(16) 



= MVJV(f3 0, (AK)"Mr(A I 1,6) 



/det(AK) 



exp 



A 



/3' K/3 be 



V (S^r) 

cx A''/^ gxp (^->^(b+ ^/3'^K/3^ ) 
oc r I A 



^ + l,6+l/3^K/3). (17) 



That is, the Gamma prior, 4, is updated by the data 
in such a way that some smoothing is enforced by 
the dimension of the sphne but the scale param- 
eter, which controls the amount of smoothing (or 
how likely the smoothing is), is shifted according 
to the second difference of the parameters. It is 
still possible to have zero smoothing, but the pos- 
terior density is small, so that a lack of smoothing 
must be justified by the data. 

For two dimensional splines the sampling of A 
would be as simple as the univariate case if the 
smoothing was set a priori to be the same for each 
covariate in the tensor product. It is more valid, 
though, to assume that this is not the case and 
that there may be more smoothness in one direc- 
tion than the other, implying two different A pa- 
rameters. The conditional distribution of these pa- 
rameters, assuming that the same hyperpriors for 
the Gamma prior are used for each spline, is 



X 



p{Xi,\2 I /3) « Vdet (AiPi+A2P2)(AiA2)""' 

g-i/3'^(AiPi+A2P2)/3g-h(Ai + A2)^ (-;^8) 

Because the conditional distributions of Ai and 
A2 are not easy to obtain, we sample them with 
a Metropolis-Hastings step (Hastings, 1970) inside 
the Gibbs sampler. 

2.6.4- Credible intervals 

Credible intervals of parameter estimates can be 
calculated quickly from the MCMC samples. For 
any parametric terms, 95% credible intervals will be 
reported. For the coefficients of splines, the credible 
interval for the entire marginal effect is of interest 
and so individual credible intervals for each param- 
eter are not so directly interpretable for the spline 
terms. 



2. 7. Forecasting 

The model forecasts in a sequential manner, con- 
ditioning each forecast on the observed values and 
forecast values preceding it according to the autore- 
gressive structure. The uncertainty in forecast val- 
ues will accumulate, as discussed in Section 2.5. 

In Section 3.2, to take advantage of the growing 
set of training data the model is refitted every 20th 
day. For each day, forecasts are produced one time 
hour at a time for the coming forty-eight hours, 
midnight to midnight, using the most recent fit of 
the training data and the residuals, e, for the past 
week. These residuals are assumed to be available 
until noon. The forecast values for each observation 
made 24 and 48 hours in advance will be stored. 

The modelled value of each forecast is obtained 
by calculating p, = X/3 for the values to be forecast, 
based on the observed covariates (in practice, these 
may be known and the forecasting treated as im- 
putation, they may be forecasts themselves from a 
model for the covariates, may be generated from a 
prior, etc.); calculating the residuals, e, for the week 
before the forecasting is to start; and construct- 
ing autocorrelated forecasting errors from the esti- 
mates of 4>. These are all constructed from MCMC 
output, so distributions of these parameters are all 
available. The forecasts, yi, are calculated from the 
MCMC iterations so that 



(*) 



(t) 



e) ■ 



Because both Jli and Ei are normally distributed, yi 
is also normally distributed. The posterior density 
of the forecast values, then, is the empirical density 
of the samples and is asymptotically normal as the 
number of samples increases. 

For further information, see Sections 2.1.3 to 
2.1.5 of M0lgaard et al. (2012). 

2.8. Posterior checks 

Convergence of the MCMC chains can be checked 
by examining the trace of the Markov chains and 
that the posterior kernel density estimate is uni- 
modal and that its shape is appropriate to the dis- 
tribution from which it was drawn. 

As the posterior for a fitted smooth is a multivari- 
ate normal it is important to remember that even 
though the 95% CI for a B-spline coefficient (which 
is marginally normally distributed) may contain 
zero, it is the joint effect that we are interested 
in. Indeed, because the spline is centred around 
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zero, it is not unexpected that zero would be con- 
tained in the credible interval of some coefficients. 
It is more informative to look at whether zero is 
contained within the credible interval for the entire 
smooth and therefore the spline can be said to be 
identically zero. 

Plots of the posteriors of the marginal effects are 
a very informative way of both visualising splines 
and checking that the output of the model is con- 
sistent with the physical system that the statistical 
model is attempting to represent. 

For plotting, we calculate a new basis with evenly 
spaced covariate values between the extreme values 
of each covariate in the data. This gives a basis 
with the same knots as the basis used for fitting 
the model but with a set of covariates that will 
give a visually more pleasing plot. The mean and 
95% credible regions for these posteriors are calcu- 
lated by using the output from the Gibbs sampler. 
Bivariate splines will be plotted as surface and/or 
contour plots. 

For a Generalised Linear Mixed Model, which en- 
compasses a GAM with splines for random effects 
(Fong et al., 2007), 



y 



-1 'X/3 + Z7 



the fitted values are given by 
y =y$ + Z7 



where X = 



X 



inx (Z+fc) 



(for I parametric 



terms and k non-parametric basis functions) and 



0;x/ 0;xfc 

Ofcx/ a2(cov(7))-i 



(19) 



is the posterior precision of (3 from 9 (Ruppcrt 
et al., 2003). 

The total effective number of degrees of freedom 
of the model is given by the trace of the hat matrix 



H = (X^X + A) ^ X^X 



(20) 



and for each non-parametric term, the trace of the 
block matrix of the hat matrix corresponding to the 
appropriate coefficients gives the effective number 
of degrees of freedom for that non-parametric term 
(Fong et al., 2007). 

Rather than estimate the mean effective degrees 
of freedom post hoc with the posterior means of the 



parameters we can use the samples from the MCMC 
chains as they are drawn to construct H. In this 
way, we obtain the distribution of the number of 
parameters for each term by examining the density 
estimate of the trace of the blocks of H. 

The Deviance Information Criterion (DIG) is a 
Bayesian analogue of the Akaike Information Gri- 
terion and penalises the deviance, a measure of the 
goodness of fit of a model, by the effective num- 
ber of parameters in the model (Spiegelhalter et al., 
2002). The effective number of parameters can be 
compared to the effective degrees of freedom (de- 
scribed above) . The effective number of parameters 
is based on the deviance and so the DIG does not 
require the integrating out of random effects param- 
eters like Schwarz's Bayesian Griterion (Schwarz, 
1978). 

While calculation of the credible interval for <p 
will provide information about the degree of au- 
tocorrelation of the residuals, the autocorrelation 
function of both e and u can be used to check how 
much autocorrelation remains. The posterior co- 
variance matrix from the chains of e and u can also 
be used to characterise the remaining autocorrela- 
tion of the residuals. 

The probability integral transform (PIT) will be 
used to assess the quality of the model fit. The 
PIT is a measure of the cumulative density of the 
forecast values from their predictive distribution 
(Dawid, 1984; Diebold, Gunther and Tay, 1998). 
This predictive GDF, denoted Fi, is to be com- 
pared to an unknown "true" GDF Gi through the 
observed values that the true physical process gen- 
erates. Hi, and the forecast values, y^. The ideal 
forecasting model is achieved when Fi = Gi. A nec- 
essary condition for choosing a forecasting model 
is that the distribution of Fi{yi) is uniform, corre- 
sponding to ideal forecasts. 

The PIT for each forecast value is calculated dur- 
ing the forecasting step as 

FiiVi) = ecdf {yu lii+Ei) 

where ecdf is the empirical cumulative density func- 
tion of the posterior samples of Jli -\- Si from the 
forecasting routine. The PIT is visualised by plot- 
ting the histogram and autocorrelation of Fiijji). If 
the PIT is uniformly distributed, then the values of 
the probability density function corresponding to 
the pit's cumulative density function should be a 
normal distribution with mean and standard de- 
viation 1. The autocorrelation of the PIT should 
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not be significant beyond the number of steps used 
in forecasting. 

Many competing models may have a uniform PIT 
so the uniformity is not a sufficient condition for 
choosing one model over another (Hamill, 2001). 
To remedy this, Gneiting et al. (2007) recommend 
maximising the sharpness of the predictive distribu- 
tion, i.e. choosing forecasts which are highly con- 
centrated about the observed values. The variance 
of the forecast values will be estimated by charac- 
terising their uncertainty with the width of the 95% 
credible interval of the simulated forecasts (which 
are distributed normally). The 95% credible in- 
terval corresponds to 1.96 standard deviations of 
the estimate, so dividing the interval half width by 
1.96 and taking the square will give an estimate of 
Var(y). 

3. Case studies 

To demonstrate the performance of the method- 
ology in terms of its ability to fit smooth covariate 
effects and autocorrelated residuals we provide two 
case studies. The first is a simulation study with a 
univariate cyclic smooth and a bivariate interaction 
term. The second case study involves modelling and 
forecasting ultrafine particle number concentration 
in Helsinki, Finland, a real world data set that ex- 
hibits a significant amount of autocorrelation. 

3.1. Simulation 

As a first case study, to illustrate the method, 
we provide an example with simulated data using 
a Gaussian likelihood, as in Chib (1993). Data is 
simulated as described in 21. Two covariates, x 
and y, are each drawn from a uniform distribution 
and a non-linear interaction term constructed that 
requires that a 2D spline be fit. A sinusoidal func- 
tion to simulate a temporal trend (at evenly spaced 
intervals) is added, as is some autocorrelated noise. 

i =1,2,...,24 

, w sin(^) 
//i = sm [T:xi) (1 - XiDi ) H ^ 

(1 + 0.4L) e, -AA (0,0.1) 

. 

The model for this simulation thus contains: a 
univariate spline for t with a basis of six second 
order cyclic B-splines with a second order penalty 



prior; a bivariate spline for fitting x and y together, 
consisting of the tensor product of two second or- 
der non-cyclic B-splines bases with six basis splines 
each and a second order penalty prior for each direc- 
tion; an intercept term with a weakly informative 
prior; and an AR(1) model for the residuals. 

3.2. PNC in Helsinki 

Ultrafine particles are of interest in a range 
of areas, including physics (Hinds, 1999), urban 
planning (Moschandrcas, 1998) and epidemiology 
(de Hartog et al., 2003; Krewski et al, 2009; HEX, 
2010). In the urban environment it has been es- 
tablished that a significant portion of the ultrafine 
particles come from vehicles (Morawska et al., 2008; 
Virtanen et al., 2006; Pohjola et al., 2007; Hussein 
et al., 2007) and that the particle number concen- 
tration (PNC) varies non-linearly and irregularly 
over the day, week and with meteorology (Meji'a 
et al., 2007; Morawska et al., 2002; Hussein et al., 
2006; Ketzel et al., 2004; Perez et al., 2010; Wu 
et al., 2008; Wehner and Wiedcnsohler, 2003). 

Continuously measured time series of ultrafine 
PNC exhibit temporal trends, dependence on mete- 
orology and autocorrelation (Wehner and Wiedcn- 
sohler, 2003; Perez ct al., 2010; Wu et al., 2008; 
Hussein et al., 2004, 2006; Jarvi et al., 2009). As 
such, the desire for flexible models which take these 
features into account without specifying the func- 
tional form of the relationship has motivated the 
use of the Generalised Additive Model with splines 
as basis functions and Generalised Linear Models 
with Fourier series basis functions. 

In this case study we use hourly averaged size 
fractionated PNC, recorded at the SMEAR-HI sta- 
tion at the Kumpula campus of the University of 
Helsinki using a twin DMPS system (Jarvi ct al., 
2009). Meteorological data was recorded at the 
university campus on the rooftop of the Physicum 
building. For further information on the data 
collection, see sections 2.2-2.4 of M0lgaard et al. 
(2012). 

Four different specifications of the model are fit- 
ted and model choice is made with the DIG. The 
reduction in autocorrelation is analysed and esti- 
mates of the fitted splines are provided. 

The autocorrelation function of the ultrafine 
PNC in Helsinki is shown in Figure 4. This auto- 
correlation motivates the use of splines to estimate 
temporal trends and a model for the residuals which 
can capture any remaining variation. 
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Parameter(s) 


Basis type 


Basis size(s) 


Periodic 


Covariance structure 


Hour of the day 


B-spline 


6 


Y 


2nd order penalty 


Day of the week 


Factor 


7 


N 


iid 


Day of the year 


B-spline 


6 


Y 


2nd order penalty 


Wind direction 


B-spline 


8 


Y 


2nd order penalty 


Wind speed 


Thin plate 


6 


N 


iid 


Temperature 


B-spline 


8 


N 


2nd order penalty 


Traffic 


B-spline 


8 


N 


2nd order penalty 


Relative humidity 


B-spline 


8 


N 


2nd order penalty 


Solar radiation 


B-spline 


8 


N 


2nd order penalty 



Table 2: List of covariates and their univariate bases used in fitting a semi-parametric regression and 
forecasting model to the Helsinki data. 




100 200 300 400 500 600 

Lag 



Figure 4: Autocorrelation function of ultrafine 
PNC in Helsinki for the data described in Section 
3.2. Local maxima occurring at lags which are mul- 
tiples of 24 correspond to a daily trend relation- 
ship, multiples of 168 to a weekly relationship. The 
maximum lag here corresponds to four weeks. The 
dashed lines represent the 5% significance level. 



Table 2 describes the basis functions used in fit- 
ting the regression model. The thin plate spline 
used for wind speed includes a linear fixed effect 
and the random effects are five first order polyno- 
mials, |a:i — A tensor product of the wind speed 
and wind directions is used, with a basis size of 
48. An attempt was made to use P-splines for the 
wind speed term but the presence of splines with 
no support at the extremes of the covariate space 
is undesirable. 

Four models are fit and the "best" model is cho- 
sen with the DIG, calculated at the final stage of the 
model fitting. The first model is the one described 
by Table 2 and the joint wind speed and wind direc- 
tion term described above. The second model re- 
places the univariate splines for temperature, traffic 
and relative humidity with a tensor product of each 
of those covariates with wind direction. The third 
model retains the univariate splines as well as the 



tensor products. The fourth model is the second 
model with the daily trend replaced with a tensor 
product of the daily trend and annual trend in or- 
der to recognise that the daily trend may change 
over the year. 

The annual trend was excluded from the first 
three models as solar radiation and temperature 
exhibit very strong annual trends (FMI, 2011). Its 
inclusion in the fourth model is for the purposes of 
the tensor product with the daily trend. We will 
provide a plot of the fitted tensor as well as the 
marginal annual trend and marginal daily trend. 
These marginal trends are obtained by averaging 
over a prediction basis defined on a mesh consist- 
ing of one copy of each unique combination of time 
of day and day of the year. 

For the autocorrelated residuals, lags at values of 
1 hour, 24 hours and 168 hours are used. These lags 
represent, respectively, an attempt to capture the 
leftover hour to hour variation from fitting the daily 
trend, the leftover daily variation from fitting the 
daily trend and the leftover weekly variation after 
fitting the day of the week factor term. 

The models will be fit to the first three years of 
the four years of data and predictions, including 
the autocorrelated residuals, will be made for the 
following year with the measured covariate values. 

4. Results and Discussion 

Simulation 

MCMC estimation was conducted by drawing 
5000 MCMC samples from the posterior; 500 ini- 
tial samples are discarded as burn in. 

We see (Figure 5) that the periodic term is esti- 
mated accurately by the model, as is the non-linear 
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interaction. We note that the credible intervals for 
the periodic term are quite similar in width across 
the entire covariate domain. This is due to the peri- 
odicity of the basis, uniform spacing of the covariate 
values and that each unique covariate value occurs 
the same number of times. 

The fitted two dimensional spline captures the 
curvature and asymmetry of the simulation func- 
tion (Figure 5). 

Traces and densities of the posterior samples 
drawn from the posteriors of all parameters have 
converged and are unimodal and normally dis- 
tributed except for the standard deviation parame- 
ter whose square is inverse Gamma (Figure 6). 

Figure 7 gives trace and density plots for the 
smoothing parameters, which are Gamma dis- 
tributed. We see that A^^ and have different, 
though not overly so, credible intervals. Had the 
2D covariate been more oscillatory it is likely that 
we would end up with markedly distinct smooth- 
ing parameters (see Filers and Marx, 2003, sec. 6). 
The priors for these smoothing parameters have a 
maximum at zero. The maximum in each of these 
posteriors now occurs at a non-zero value. While 
the parameters are still (approximately) Gamma 
distributed, they are no longer r(l, •). 

The first parameter is the intercept term, /3o, pa- 
rameters 2 to the 37 are the coefficients of the two 
dimensional spline, parameters 38 to 43 are the co- 
efficients of the periodic spline. Parameter 44 is <j>, 
the autoregressive parameter from the AR(1) model 
for the residuals and the final parameter is the 
standard deviation of the independent identically 
distributed errors, whose square is inverse gamma. 
All other parameters are marginally normally dis- 
tributed. 

The contours of the fitted 2D spline show that the 
simulated 2D covariate effect has been accurately 
reconstructed (Figure 8). The estimate for f3o has 
been added to the 2D spline to allow for a more di- 
rect comparison of the values of the contours, as the 
fitted spline is centred about zero but the 2D covari- 
ate effect is not. Note that both the simulated tem- 
poral trend and its corresponding univariate spline 
are centred about zero and the AR errors also have 
a zero mean. The credible intervals for the two di- 
mensional spline are quite wide in regions where 
there are not many observations and at the edges 
of the covariate space. The mean surface of the 2D 
smooth has no local maxima or minima, the pres- 
ence of which would indicate excessive wiggliness. 

Figure 9 shows that the residuals in u have 



1 2 3 




(0.162,0.975) (0.286,1.437) (0.107,0.834) 




Figure 7: Trace and density plots for MCMC sam- 
ples of smoothing parameters for simulated data. 
From left, A^^, Xy and At. 

smaller autocovariance than the sampled values of 
e. The diagonal bands correspond to 24 lags, indi- 
cating that the variation 24 observations apart may 
be modelled by expanding the model for the AR 
residuals to contain lags 1 and 24. Even so, the lag 
24 bands and the background are lighter for u than 
for £ suggesting that modelling the autoregressive 
nature of the residuals has reduced the correlation 
of the posterior samples of the residuals. 

The effective degrees of freedom of the intercept 
are concentrated around 1 with very little variance; 
it would be troubling were this not the case (Table 
3). The 2D spline has between 21 and 27 effective 
degrees of freedom; the basis for this term has 36 
elements, so each basis element requires, on aver- 
age, fewer degrees freedom than would be required 
by the corresponding polynomial basis. Similarly, 
the size of the basis for the temporal trend is six 
basis elements (and would have been eight had we 
not insisted on periodicity of the basis) and the ef- 
fective degrees of freedom of this term is slightly 
less than five. 

4.2. PNC in Helsinki 

All but the first year is forecast both one day and 
two days in advance. The forecast values are stored 
throughout the iterative forecasting such that all 
bar the first year of the data is forecast conditioned 
on the preceding data. 
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Figure 5: Marginal effects of splines (solid), 95% credible interval (dashed) and the true values from the 
simulated data (dots). 



Term 


2.5% 


50% 


97.5% 


Mean 


SD 


Intercept 


1.00 


1.00 


1.00 


1.00 


0.00 


2D 


23.34 


25.61 


27.45 


25.57 


1.03 


Temporal 


4.95 


4.98 


4.99 


4.98 


0.01 


Total 


29.32 


31.59 


33.43 


31.55 


1.03 



Table 3: Effective degrees of freedom summary 
statistics for model fit to simulated data. 



Of the four models fitted, the model with the 
joint annual-daily trend had the lowest DIG (Table 
4). The triangle shape of the PIT is very similar 
for each of the models and indicates biasedness in 
the samples towards slight underprediction (Figure 
10a). The estimates of the variance of the modelled 
values (Figure 10b) indicates that while the models 
provide estimates with similar variances, the joint 
annual-daily temporal trend with tensor product 
meteorology provides more concentrated forecasts. 
This model also has the lowest DIG out of the four 
models fitted. Therefore, this model represents the 
most efficient fit in terms of goodness of fit versus 
model complexity and has the most concentrated 
forecast values out of a group of models which per- 
form similarly under the PIT. All subsequent anal- 
ysis in this section is performed on this model. 

Figure 11 shows the posterior density estimates 
and 95% credible intervals for the mean, standard 
deviation, (a) and (b), and the mean and 2.5% and 
97.5% quantiles for the fitted smooth functions of 



temporal trends and non-linear covariate effects (c) 

The daily trend in PNG varies over the year 
(see (c)) with most days having a peak around 
10am. This trend peaks during the summer pe- 
riod (days 140-250) with a daily peak around 11pm 
and a plateau from approximately 6am to midday. 
The daily trend in PNG in winter has a trough 
at 3am and a peak around midday (when most of 
the day's light occurs). Spring and autumn daily 
trends contain two peaks, one in the late evening 
around 9pm and one in the morning aroung 9- 10am. 
The overall shape of these temporal trends is con- 
sistent with previously reported temporal trends 
(M0lgaard et al., 2012). 

The weekly trend (d) shows a decreased partial 
effect on weekends (days 1 and 7) and after ac- 
counting for these temporal trends, much of the 
remaining temporal variation is explained with the 
Lag 1 autoregressive error although there is still 
some amount of autocorrelation at Lags 24 and 168, 
which is captured by the model. 

The joint effect of wind speed and wind direction 
(e) is roughly linearly decreasing and almost inde- 
pendent of wind direction for winds weaker than 
4m/s. When the wind is blowing above 5 m/s 
there is an observable non-linear interaction of wind 
speed and wind direction such that at 11 m/s a wind 
of angle approximately 170 degrees (3 radians) cor- 
responds to an increase in log PNG and at 30 de- 
grees (0.5 radians) the contribution to log PNG is 
negative. That is, a strong wind can either remove 
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(a) Intercept 



(b) Standard deviation 




(c) Joint daily and annual trend 




Day of the 

(d) Weekly trend 



Wind direction (radians) 

(e) Joint wind effect 



Wind direction (radians) 

(f) Humidity 




id directnn (radians) 

Temperature 



Wind direction (radians) 

(h) Traffic 



400 eOO 800 1000 1200 1400 1600 
Solar radiatnn (W/m^) 



(i) Solar radiation 



Figure 11: Density plots of /3o, cr, temporal trends and meteorological covariate effects. 95% credible 
intervals are represented as black rugplots for parameters and as dashed lines for fitted smooth terms. The 
solid grey lines in the contour plots represent the 2.5% quantiles and the dashed grey lines represent the 
97.5% quantiles. 
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Model 



DIG pD edf rank(X) 



Univariate meteorology 13023 96.809 86.7699 94 

Tensor meteorology 12404 215.022 203.169 262 

Combination of univariate and tensor 12406 218.360 205.405 286 

Tensor meteorology with annual-daily trend 12055 244.166 230.318 292 



Table 4: Summaries of fitting the models with differing treatments of meteorological covariates with respect 
to wind direction and temporal trends. DIG, effective number of parameters, mean effective degrees of 
freedom, and number of columns in the design matrix, X. 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 

(a) True 2D covariate 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 



(b) Combined partial effects of 2D spline and 
constant 

Figure 8: Comparison of contours of the simulated 
2D covariate from 21 and the fitted 2D spline. Ob- 
served locations of x and y are plotted as points. 
The 2.5% quantile is represented with long grey 
dashes, the 97.5% quantile is represented with short 
grey dashes. 




(a) Covariance of posterior (b) Covariance of posterior 
samples of autocorrelated er- samples of iid errors, Ui, of 
rors, £i, of each observation each observation 



Figure 9: Covariance matrices of posterior samples 
of autocorrelated errors, e and independent identi- 
cally distributed residuals u, from simulated model. 

particles from the microenvironment or transport 
them from a nearby source. 

Marginalising this joint trend over the year pro- 
vides an estimate of the mean daily trend in PNC 
and vice versa for marginalising over the day to ob- 
tain the mean annual trend (Figure 12. The mean 
daily trend in PNC exhibits two peaks, one at 10am 
and one at 10pm. The mean annual trend shows a 
maximum in summer, when the days are long. 

Marginalising the posterior for the joint daily and 
annual trend (Figure 11), the mean daily trend in 
PNC exhibits two peaks, one at 10am and one at 
10pm. The mean annual trend shows a maximum 
in summer, when the days are long (Figure 12). 

The autocovariance of the posterior samples of 
the residuals for observations 1000 to 1200 are 
shown in Figure 13. The raw residuals, e, ex- 
hibit a noticeable amount of autocorrelation around 
observation 1050, indicating that while the semi- 
parametric regression model for the covariates has 
fitted a smooth annual and daily trend, not all the 
temporal variability has been explained. This resid- 
ual variability is captured with by explicitly mod- 
elling the autocorrelation in the residuals and we 
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see that the posterior covariance of these residu- 
als decreases significantly when examining it, the 
residuals with explicitly modelled autocorrelation. 

The effect of high humidity (f) above 90%, is 
to reduce the total number of particles in the air 
because of precipitation. The effect in the middle 
range is fairly flat with some local peaks. The ef- 
fect of temperature (g) is generally decreasing and 
shows some dependence on wind direction but there 
are no local peaks or troughs. The effect of traf- 
fic density (h) increases steadily up to about 3500 
vehicles per hour with some dependence on wind 
direction. The increase is not as marked at a wind 
direction of about 230 degrees (4 radians), with a 
peak occurring around 140 degrees (2.5 radians). 

Much of the effect of solar radiation Hi is ac- 
counted for by the joint daily and annual trend. 
The estimates of the effect of temperature, wind 
speed and direction, solar radiation and humidity 
(as a proxy for rainfall) and weekly trend show 
very good agreement with those previously reported 
(Clifford et al., 2011). In both the results presented 
here and by Clifford et al. (2011) the effects of tem- 
perature, humidity and wind speed are generally 
decreasing. Exceptions include strong wind speeds 
at certain headings and the positive effect of in- 
creased humidity until 90% as described above. 

The estimates of cf) (Figure 14) do not contain 
zero in their credible intervals. We see a high 
amount of positive autocorrelation at Lag 1 and a 
small, but strictly positive, amount at Lags 24 and 
168. 

By specifying the regression model to include 



autoregressive residuals rather than autoregressive 
PNC, the mean daily trend (Figure 12) has peaks 
which occur at 10am and 10pm rather than at 8am 
and midday. The temporal variation in the model 
presented in this section can be expressed as 

log2/i = / (houri, day of the year J + . . . + 

(1 - 01 - 024 - (/>168) 

while the temporal variation in the model of Clif- 
ford ct al. (2011) corresponds to 

\ogyi = fi (houri) + f2 (day of the year^) + 

fs (logy.-i) + fi (log7;,_24) + ■■■+ei. 

These two specifications are quite different as one 
contains a joint model of daily and annual trends 
and models the residuals autoregressively while the 
other models the residuals as autoregressive. As 
such, they provide quite different estimates of the 
daily trend. The weekly trend is the same across 
both models, with a maximum on Wednesday and 
minium values on the weekend. 

To illustrate the modelling of the autoregressive 
nature of the residuals a contiguous subset of the 
modelled values and residuals was randomly se- 
lected, corresponding to observations 1000 to 1200. 
This is approximately eight days of measurements. 
Analysis of the autocovariance of the posterior sam- 
ples of the residuals can be found in the appendix. 

The effective degrees of freedom for the intercept, 
each spline term and the model overall are given in 
Table 5. Note that the intercept has exactly one 
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Joint annual trend 



Univariate winrt 




(a) Probability integral transforms 




- JtDint annuai trend 

- Univariate wind 
Joint wind 
Tensor only wind 



0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 0.3 

(b) Density estimates of the variance of the fore- 
cast values. 

Figure 10: PITs and density estimates of forecast 
values for each of the four models fitted. 



degree of freedom and that each spline has approx- 
imately one less effective degree of freedom than the 
number of basis splines; this is due to the constraint 
that each spline is centred around zero. The effec- 
tive number of parameters for this model according 
to the DIG is 243.443. The effective number of de- 
grees of freedom for the linear predictor is 230.643 
(Table 5) . These two estimates of model complexity 
agree quite well given the degrees of freedom given 
up by the constraint on each spline. 

As described in Section 2.7, estimates of forecast 
values are obtained by modelling the observed val- 
ues, modelling the future value based on measured 
meteorological measurements (in reality, weather 
forecasts are available instead of observed values) 
and adding the autocorrelated noise based on the 
estimates of </>. The modelled values (for both the 
observed and future values) are conditioned on </> 
and so any inference performed on the forecasting 





(a) Posterior covariance (b) Posterior covariance 
of e of It 

Figure 13: Autocorrelation of residuals estimated 
from posterior samples of e and u. 



must be done on X/3 + e rather than the linear pre- 
dictor X/3. 



5. Conclusion 

This paper presents a regression method which 
combines semi-parametric regression, in the form of 
penalised splines, and a GLM with autoregressive 
residuals. 

It was shown in the simulation case study of sec- 
tion 4.1 that the method described is capable of ap- 
proximating the underlying smooth functions used 
to generate the data as well as the autoregressive 
noise which was added. The resulting smooth re- 
produced the underlying 2D function without the 
oscillations which characterise the use of B-splines 
with many basis functions and a small amount (e.g. 
none) of smoothing (Filers and Marx, 2010). 

In section 4.2 the modelling methodology was 
applied to some real data from Helsinki to infer 
the temporal trends and the effects of various me- 
teorological and physical phenomenon on ultrafine 
PNC. The resulting fits were consistent with pre- 
vious studies of ultrafine PNC in Helsinki (Clifford 
et al., 2011; M0lgaard et al., 2012) but provided 
fitted smooth functions which do not exhibit the 
oscillations typical of the use of a Fourier series ba- 
sis. 

Despite the models here having a basis size of 
more than 200 and a number of smoothing param- 
eters and AR parameters, convergence is fast due 
to the use of the Gibbs sampler and the sum to 
zero constraint which ensures identifiability. The 
first 200 samples were discarded for each block of 
model fitting and the 2000 samples used for pos- 
terior inference came from stable chains, yielding 
normal posterior densities (e.g. Figure 11a). 
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Term 


2.5% 


50% 


97.5% 


Mean 


SD 


Intercept, /3o 


f .00 


f .00 


1 A A 

f .00 


1 A A 

f .00 


A AA 

0.00 


Wind speed and direction 


46.54 


47.01 


47.41 


47.00 


0.22 


Solar radiation 


6.93 


6.97 


6.99 


6.97 


O.OI 


Annual and daily trend 


31.39 


32.42 


33.51 


32.42 


0.55 


Weekly trend 


6.00 


6.00 


6.00 


6.00 


0.00 


Temperature and wind direction 


38.14 


40.55 


43.28 


40.59 


1.37 


Traffic and wind direction 


48.30 


50.12 


51.99 


50.15 


0.96 


Relative humidity and wind direction 


44.22 


46.44 


48.92 


46.52 


1.18 


Total for entire model 


227.18 


230.66 


233.91 


230.64 


1.68 



Table 5: Effective degrees of freedom summary statistics 



By converting the univariate spline bases of the 
local meteorological covariates (traffic count, tem- 
perature, etc.) to a tensor basis of wind direction 
and the those covariates the DIG was reduced from 
13023 to 12404, indicating that the effect of these 
covariates is dependent on wind direction. This 
analysis was absent from Clifford et al. (201 1) and 
the move to a penalised B-spline basis improved 
the flexibility of the model over that described by 
M0lgaard et al. (2012). Fitting a model with the 
univariate bases and tensor bases for the covari- 
ates did not provide a qualitatively different fit and 
merely inflated the DIG. 

The concurrent modelling of the residuals showed 
a high level of autocorrelation in the residuals at lag 
1. The value of the autocorrelation parameters at 
lags 24 and 168 is small but non-zero and these 
lags correspond to modelling the variation in the 
residuals which is left after modelling the daily and 
weekly trends. Modelling this autocorrelation re- 
duced the magnitude of the posterior covariance of 
the residuals. 

Modelling with penalised splines captured the 
smooth temporal trends and the autoregressive 
model for the residuals explained the residual 
non-smooth variation. By combining these in 
the same model, rather than doing it in two 
steps, the MGMG sampler can trade the smooth 
trend and rough residuals off against each other; 
the posterior density for (3 contains X* — 
(1 — 4>{L)) X and the posterior for (p contains E = 
(y-X/3) (l,L,L^...,LP). Fitting this model in 
a two-step process (i.e. modelling the autocorrela- 
tion in the residuals post hoc) would not allow this 
trade-off. 

With respect to the methodology outlined in Sec- 
tion 2, the number of knots for each spline is fixed 



rather than allowing the number of knots in the 
splines to vary and using a reversible jump MGMG 
method (Billcr, 1998). The advice of Eilcrs and 
Marx (2010) is to use a large number of knots, per- 
haps more than is "necessary" , and to allow the 
smoothing penalty to control how wiggly the result- 
ing smooth fit is. Ruppcrt et al. (2003) and Wand 
(2003) suggest choosing the number of knots for a 
univariate spline as min(n/4, 35) (for n the number 
of unique values the covariate takes) to ensure that 
the non-linear features are captured. The equiv- 
alent degrees of freedom for the spline will gener- 
ally be substantially lower than the number of basis 
splines used. As such, pD will typically not increase 
as more knots are used, because the DIG (and pd) 
are calculated from the deviance (which stabilises 
as the basis size is increased). 

Modelling with tensor products of splines allows 
the investigation of the interaction of two covariates 
which may have non-linear effects and non-linear 
interactions without specifying a particular func- 
tional form, which may often be no better than a 
subjective guess. 

Adding random effects to a GLM converts it to 
a Generalised Linear Mixed Model. By analogy, 
random effects can be added to a GAM to form a 
Generahsed Additive Mixed Model (GAMM) (Lin 
and Zhang, 1999). A mixed effect spline can be 
formed by taking the tensor product of a univariate 
(or higher dimension) spline with an identity ma- 
trix representing the different groups for the mixed 
effect. As the basis vectors stay the same across 
mixed effect groups, there is no need to include mul- 
tiple copies of the basis. It is desirable to come up 
with a way to pass an argument to the model setup 
which allows the reuse of basis functions, similar 
to the by= argument in the R package mgcv Wood 
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Lagi 

(a) Lag 1 



g 50 
a 




0.05 0.055 0.06 0.065 0.07 0.075 

Lag 24 

(b) Lag 24 




0.025 0.03 0.035 O.or 0.045 0.05 

Lag 168 



(c) Lag 168 

Figure 14: Density plots of autoregression param- 
eters 4>. 95% credible intervals are represented as 
black rugplots. 



(2011). This would be a computational improve- 
ment over forming the tensor product of a spline 
and factor term. 

The method presented here provides flexible fit- 
ting of covariates which may have non-linear effects. 
The focus has been on temporal trends using cyclic 
B-spline basis functions with smoothing penalties 
to ensure there are no spurious oscillations. The an- 
ticipated autocorrelation of the residuals has been 
explicitly modelled to account for much of the tem- 
poral variation that remains after removing smooth 
trends. 
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